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Abstract: Automatic facial expression analysis is an area of great research especially in the field of computer vision and robotics. In the work done so 
far, the facial expression analysis is done either by recognizing the facial expression directly or indirectly by first recognizing A Us and then applying 
this information for facial expression analysis. The various challenges in facial expression analysis are associated with face detection and tracking, 
facial feature extraction and the facial feature classification. The presented review gives a brief description of the time line view of the research work 
carried for AU detection /estimation in static and dynamic image sequences and possible solutions proposed by researchers in this field since 2002. In 
short, the paper will provide an impetus for various challenges and applications of AU detection, and new research topics, which will increase the 
productivity in this exciting and challenging field. 
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INTRODUCTION 

Human face is a very useful and powerful source of depicting the human behaviour as it is the most natural way to express their 
emotions. Research shows that the verbal part of a message contributes only for 7%; the vocal part contributes for 38% while facial 
expression of the speaker contributes for 55% of the effect of spoken message as a whole. It implies that facial expression plays the 
most important role in human communication [1, 5], Facial expression analysis, which is an important media of Hitman- Computer 
Interface (HCI), gained much attention after the development of technologies like pattern recognition and artificial intelligence [1 , 5], 
Moreover, machine understanding of facial expressions can bring revolution in HCI. 

The facial expression analysis is generally accomplished in three steps: face detection (also called face acquisition), facial feature 
extraction and facial expression classification. Face detection refers to localizing or detecting the face region in input images or image 
sequences. Facial feature extraction is accomplished for extracting information about the encountered facial expression. Facial 
expression classification refers to classification of the observed facial expression, which can be done in two ways i.e. applying a 
classifier that can directly classify the emotions or in terms of Action Units (AUs) that cause an expression and then recognizing 
expressions. Facial expression analysis is done either directly or indirectly. In direct approach, facial expressions are recognized 
directly. In indirect approach, first facial action units are recognized and then facial expression is recognized. 

Facial Action Coding System (FACS) is the best known and most commonly used sign judgement approach developed to describe 
facial actions. It describes 44 AUs that provides information of all visually detectable face changes. FACS coder decomposes a shown 
expression into the specific AUs that describes the expression. In 1971 , Ekman and Friesen defined six basic emotions that comprise 
of happiness, sadness, fear, disgust, surprise and anger, and are also referred to as basic emotions [lj. 
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To make the AU detection more robust and more efficient, we deal in both frontal-view and profile-view as some Alls like puckering 
of Kps or pushing of jaws forward are not clearly detectable in frontal view, but are clearly observable in profile view as these Alls 
represent out-of-image plane non-rigid face movements whereas movement of the eyebrows and changes in appearance of the eye 
cannot be detected in profile view but are easily observable in frontal facial view. Moreover, this is one of the major step to establish a 
technological framework for automatic AU detection from multiple views of the face [12, 13]. 

Automatic recognition of AUs can be done on appearance-based features, geometric features or both. Appearance-based approach uses 
example images called templates of the face to perform recognition and feature extraction. It includes selecting the best set of Gabor 
wavelet filter, using AdaBoost and train Support Vector Machine (SVM) to classify AUs. Geometric-feature-based approach method works 
on shape or deformation of facial features such as position or velocities of facial fiducial points or the distance between these points. 
Geometric-based approach involves automatically detecting n- facial points and use a facial point tracker based on particle filtering with 
factorized Mkefihoods to track these points [8, 9, 10]. 

Current challenges in AU detection are face occlusion, glasses, facial hair and rigid head movements, that occur in real-world 
frequently. Out-of-plane head movements also lead to self-occlusion of face. As AUs are more locahzed in the face than expressions of 
emotions the problem of occlusion is much bigger in AUs as compared to emotions [11, 12], The problem of rigid head movement can 
be solved by using head-mounted cameras which in turn reduces the freedom of movement of the subject making it uncomfortable for 
the subject. The intensity of AUs and its temporal phase transition (onset, apex, and offset) detection is still a area of concern. The 
solution to the above mentioned concern will result in detection of complex as well as higher level behaviour deception, cognitive 
states like agreement (disagreement), and psychological states like pain [12], Moreover, the proposed methods so far are not able to 
encode all the 44 AUs defined in FACS, simultaneously. 

The presented survey focuses on feature extraction and classification for AUs detection and estimation that has been adopted by 
researchers. The rest of the paper is organised as follows. Section 2 deals with facial expression feature extraction and section 3 gives a 
brief idea about all the classification techniques. Section 4 describes the various challenges and future scope in this area of research and 
conclusion. 

FEATURE EXTRACTION 

After the face has been detected in the observed scene, the next step is to extract information about the encountered facial expression. 
Feature extraction depends on kind of input image and applied face representation [1], Three types of face representation are: holistic, 
analytic and hybrid. In holistic approach, the face is represented as a whole unit. In analytic face representation, the face is modelled as 
a set of facial points or as a set of templates fitted to the facial features such as mouth and eyes. In the hybrid approach the face is 
represented as a combination of analytic and holistic approaches i.e. a set of facial points is used to determine an initial position of a 
template that models the face. The major challenges faced in feature extraction are variation in size and orientation of the face and 
obscuring of the facial features due to facial hair and glasses. 

Two types of features that are typically usually used to describe facial expression are: Geometric features and Appearance features. 
Based on these features the facial feature extraction techniques can be classified as Geometric-feature-based approach and Appearance- 
feature- based approach [2], 

Pantic et.al used particle filtering for feature extraction [13, 9, 8] which is a approach to directly detect temporal segments of AUs. 
They located and tracked a number of facial fiducial points and extracted temporal features from it. Particle filtering was introduced by 
Pitt and Shepard [26]. It became the most used tracking technique due to its ability to deal with noise, occlusion and clutter 
successfully. It also adopted to deal with colour-based template tracking and shadow problems [13]. This algorithm has three major 
drawbacks: 1) large amount of particles that resulted from sampling from the proposal density might be wasted because they 
propagated into areas with small likelihood, 2) A particle might have low likelihood but part of it may be close to correct solution, 3) 
finally, the estimation of particle weight does not consider the interdependence between the different parts of a (where a is the state 
of a temporal event to be tracked) [10, 3]. 

Later, Patras and Pantic introduced Particle filtering with factorized likelihoods (PFFL) [27] as an extension to this auxiliary particle 
filtering theory to address all the afore-mentioned problems inherent in particle filtering. PFFL addresses the problem of 

interdependencies between the different parts of state a by assuming that the state a can be partitioned into (X; such as a = {otj 

On } . PFFL tracking system can be divided into two stages. In the initial stage each facial point i is tracked independently from other 
facial points for each frame individually. In the latter stage, interdependence between the sub states are taken into account using 
proposal distribution g(a) which is product of posteriors of each (X;[ 20, 3, 10, 25]. 

Another Geometric-feature-based approach that is popularly accepted is Active Appearance Model (AAM) and its derivates to 
track a dense set of facial points. The location of these points helps us to infer the facial features and their shapes to classify the facial 
expression. Sung and Kim used Stereo Active Appearance Model (STAAMs) to track facial points in 3-D videos, as it improves the 
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fitting and tracking of standard AAMs by using multiple cameras to model the 3-D shape and all the rigid motion parameters. 
Unfortunately, the approach appears to be promising, but no results on a benchmark database were presented [11], 

Appearance-feature- based approach aims to capture skin motion and changes in facial texture due to wrinkles, furrows and 
bulge. The various Appearances-feature- based approaches are Gabor features, family of LBP-based detectors (Local Binary Pattern (LBP), 
Local Phase Quantization (LPQ)) 

Gabor wavelet is one of the most famous techniques for facial expression analysis. A Gaussian kernel modulated with a sinusoidal plane 
defines a Gabor function. To extract the texture information, filter bank with different characteristic frequencies and orientations is 
implemented for feature extraction. The decomposition of an image is computed by filtering it with the filter bank which may include 
techniques like applying Gabor filters to the difference image computed by subtracting a neutral expression for each sequence. 

Littleworth et.al.[32] applied Gabor filter bank to extract Gabor magnitudes from the whole face and then select the subset of features 
using AdaBoost method. The output of the filters selected by AdaBoost is applied to support vector machine for classification of seven 
emotion expressions. Gabor features are applied not only for extracting the features in spatial domain but also for temporal domain. 
Bartlett et.al.f4] applied Gabor features for simultaneous facial behaviour analysis. BEN JEMAA and KHANFIR [15] used gabor- 
coefficients for face recognition. In this geometric distance and gabor-coeffidents are used independently or jointly. A gabor-jet vector 
is used to characterize the face. 

To reduce the dimension of Gabor features, the high-dimensional Gabor features can be uniformly down-sampled. It is observed the 
recognition performance get effected by the choice of fidudal points and the down-sampling factor. So, an efficient encoding strategy 
for Gabor outputs is needed. Gu et.al. [21] Extended the radial encoding strategy for Gabor outputs to radial grid encoding leading to 
high recognition accuracy. This method gives better result than down-sampling method or methods involving Gabor-jet. 

LBP introduced by Ojala et al. in [28] has proven to be one of the powerful mean of texture description. The operator works by 
creating a label by thresholding a 3x3 neighbourhood of the pixel for every pixel. Ojala et al. later extended the basic LBP to a gray- 
scale and rotation invariant texture operator which allows random number of neighbours to be chosen at any point from the central 
pixel based on circularly symmetric neighbour set. It reduces the dimensionality of the LBP operator by introducing the concept of 
uniform LBP . Uniform LBP consist at most two bit wise transition from zero to one and vice versa and the binary string is considered 
circular [11, 29], 

The fascinating features of LBP are its illumination tolerance and computational efficiency. LPQ operator was originally proposed by 
Ojansivu and Heikkila as a texture descriptor that is robust to image blurring. The descriptor uses 2-D DFT or, more precisely, a 
short-term Fourier transform (STFT) computed over a M-by-M neighbourhood to extract local phase information. In real time 
application, the neighbouring pixels are highly correlated, leading to dependency between Fourier coefficients, which are quantized in 
LPQ. So, Ojansivu et al. introduced a de-correlation mechanism to improve LPQ, which is used by Jiang et al.in [29]. LPQ descriptor 
is extended to temporal domain, and the basic LPQ features are extracted from three set of orthogonal planes: XY, ST and YT, where 
XT provides spatial domain information, while the XT and YT planes provide temporal information, and is called Local Phase 
Quantization from Three orthogonal Planes (LBP-TOP). Zhao et al. [14] applied LBP-TOP to six basic emotions recognition and it is 
clearly reported that it outperformed earlier approaches like LBP, Gabor. 

FACIAL EXPRESSION CLASSIFICATION 

Facial feature extraction is followed by facial expression classification. The classifier classifies the encountered expression either as 
facial action or basic emotion or both. The classification depends on template-based, a spatial-based classification method. 

Support Vector Machine (SVM) is an excellent classifier in domains such as marine biology, face detection and speech recognition. The 
SVM classification is done in the following steps. The selected features instances are divided into two sets: training set and testing set 
[30]. A n-fold cross validation loop is employed each time a classifier is trained for search of optimal parameters. While, evaluating 
each fold, the training data is split into five sub sets, four of them are used to train a classifier and one is used to test a classifier. SVM is 
very well suited for the task of AU detection as the high dimensionality of the feature space has no effect on the training time. SVM 
classification can be summarized in three steps: 1) margins of the hyper plane are maximized; 2) the input space is mapped to a linearly 
separable feature space; 3) the ‘kernel trick’ is applied [20]. The most frequently used kernel functions are the linear, polynomial, and 
Radial Basis Function (RBF) [29]. 

SVM’s classification performance decreases when the dimensionality of the set is far greater than the training set samples. This can be 
handled by decreasing the no. of features used to train SVM which can be done by means of GentleBoost. Littlewort et. al. [32] showed 
that an SVM classifier trained using boosting algorithms outperforms both the SVM and the boosting classifier when applied directly. 
SVM is used for shape information extraction in [24]. The combination of Adaboost and SVM enhanced both speed and accuracy of the 
system. 
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Valstar and Pantic [10, 25] proposed to apply hybrid SVM-HMM (successfully applied for speech recognition) to the problem of AU 
temporal model detection. Valstar et.al [20] used Probabilistic Actively learned Support Vector Machine (PAL-SVM) to reduce the 
validation time in classifying the Alls displayed in a video. Simon et. al. [31] proposed a segment based SVM, k-seg-SVM which is a 
temporal extension to the spatial Bag-of- Words (BoW) approach that was trained with Structured Output SVM (SO -SVM). Recent 
research shows that SO-SVMs can outperform other algorithms including HMM, Max-Margin Markov Networks [31]. SO-SVMs have 
several benefits in All detection as : 1) they model the dependencies between visual features and duration of Alls; 2) They can be 
trained effectively on all possible segments of the video; 3) No assumption about the underlying structures of the AU are made; 4) 
negative examples that are most similar to the AU are selected explicitly. 

Rule-based classification method, proposed by Pantic and Rohtkrantz in 2000 classifies the facial expression into the basic emotions 
based on previously encoded facial actions. Classification is performed by comparing the AU-coded description of the shown facial 
expression with the AU-coded description of the six basic emotional expressions. This classification method gave a recognition rate of 
91%. To classify the observed changes in AUs and their temporal segments, these changes are transformed into a set of mid- level 
parameters. Six mid-level parameters are defined to describe the change in position of fiducial points. Two midlevel feature 
parameters are used to describe the motion of feature points: up/ down, in/ out (parameters calculated for profile contour fiducial 
points). The parameter up/ down denotes the upward or downward movement of point P. The parameter in/ out denotes the inward 
or outward movement of point P. Absent and inc/ dec are two midlevel feature parameters used to denote the state of feature points. 
Absent denotes the absence of point P in the in profile contour. Inc/dec defines the increase or decrease in distance between two 
points. Finally, two midlevel feature parameters: angular and increased_curvature describes two specific shapes formed between 
certain feature points. The activation of each AU is divided into three segments, the onset (beginning), the apex, and offset (ending). 
Inaccuracies in facial point tracking and occurrences of non-prototypic facial activity result in either unlabeled or incorrectly labelled 
temporal segments. This can be handled by using memory-based process that takes into account the dynamics of facial expression i.e. 
re-label the current frame/ segment from the previous and next frame label according to a rule-based system. 

Koelstra and Pantic[6, 7] used GentleBoost classifier on motion from a non-rigid registration combined with HMM. GentleBoost 
converges faster than Adaboost and is more reliable in terms of stability. It is used in [22] for feature selection in order to reduce the 
dimensionality of of the feature vectors before classification. GentleBoost algorithm selects a linear combination of features one at a 
time until the addition of features no longer improves the classification, thus giving a reasonable balance between speed and 
complexity. In some AUs the spatial magnitude projection information is not sufficient and temporal domain analysis is needed for AU 
classification. Each onset/ offset GentleBoost classifier returns a single number per frame which depicts the confidence that the frame 
shows the target AU and target temporal segment. To combine onset/offset GentleBoost classifier into a single AU recognizer, a 
continuous HMM is used. HMM uses the knowledge of prior probabilities of each temporal segment and duration derived from our 
training set. HMM supports a degree of temporal filtering and smooth out the results of the GentleBoost classifiers. However this only 
captures the temporal dynamics to a limited degree. This issue can be solved using HMM with state duration model. 

Cohen et.al. [33, 34] exploits existing methods and proposes a new architecture of Hidden Markov Models (HMM), in which 
segmentation and recognition of facial expression are done automatically. HMM is most commonly used in speech recognition, as 
HMM has the ability to model non stationary signals and events. In this all the stored sequences are used to find the best match and 
hence this approach is quite time consuming. It works on using the transition probabilities between the hidden states and learns the 
conditional probabilities of the observations given the state of the model. The two main model structures used to model expression 
are: left-to right and ergodic model. Left-to-right models involves fewer parameters, and thus easier to train. However, it reduces the 
degree of freedom of the observation sequence for the model. Ergodic HMM allows more freedom for the model. The main problem 
with this approach is that it works on isolated or pre-segmented facial expression sequences that are not available in reality. This 
problem is solved using a multi-level HMM classifier . In this, motion features are fed to the emotion specific HMM, then the state 
sequence is decoded using a Viterbi algorithm and used as observation vector for the high-level HMM (consist of seven states, one for 
each of six emotions and one for neutral). 

BEN JEMAA and KHANFIR used non-linear neural networks for classification [15]. The advantages using neural networks for 
recognition and classification are the feasibility of training a system in complex conditions like rotation and lighting. The neural 
network architecture like number of layers and nodes has to be varied to get good performance. In this study, three types of features 
are used namely, 1) Geometric distance between fiducial points, 2) Gabor coefficients, 3) combined information about Gabor 
coefficients and Geometric distance. A preliminary version of a two stage classifier combining a kNN-based and ruler based classifier 
was first presented by Valaster et.al. in [17] and later used in [18]. Applying only kNN resulted in recognition rates that were lower 
than we expected. It was observed that some of the mistakes made by the classifier were deterministic, and can be exploited using a set 
of rules based on human FACS coder. 

CONCLUSION AND FUTURE SCOPE 

Facial expression analysis is an intriguing problem but a need for the future because of its utility in domains of HCI and human 
behaviour interpretation. So, a lot of research like feature extraction, classification, pain detection has been done in this area and 
results as high as 94.70% [2], 99.98 [15], 94. 3 [7] have been attained. But, some areas still need to be explored and can work as a base 
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for any future research. The Alls for basic emotions have been detected by many researcher using different extraction and classification 
techniques, but Alls defining complex emotions have not been mentioned very clearly in the research so far. Further research can be 
done in area of co-occurrence and interdependence of Alls and detection of Alls for deception as occurrence of complex emotions 
generated due to these are more frequent in real time than basic emotions. The facial features have been extracted and classified in 
static and video images, but its real-time application has not been explored till now. 
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